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Abstract 


Note: This paper is a chap ter in the forthcoming Handbook of Cluster 
Analysis, \Hennia et ali 1201 A ). For definitions of basic clustering meth¬ 
ods and some further methodology, other chapters of the Handbook are 
referred to. To read this version of the paper without the Handbook, some 
knowledge of cluster analysis methodology is required. 

The aim of this chapter is to provide a framework for all the deci¬ 
sions that are required when carrying out a cluster analysis in practice. A 
general attitude to clustering is outlined, which connects these decisions 
closely to the clustering aims in a given application. From this point of 
view, the chapter then discusses aspects of data processing such as the 
choice of the representation of the objects to be clustered, dissimilarity 
design, transformation and standardization of variables. Regarding the 
choice of the clustering method, it is explored how different methods cor¬ 
respond to different clustering aims. Then an overview of benchmarking 
studies comparing different clustering methods is given, as well as an out¬ 
line of theoretical approaches to characterize desiderata for clustering by 
axioms. Finally, aspects of cluster validation, i.e., the assessment of the 
quality of a clustering in a given dataset, are discussed, including find¬ 
ing an appropriate number of clusters, testing homogeneity, internal and 
external cluster validation, assessing clustering stability and data visual¬ 
ization. 
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6 Conclusions 


1 Introduction 


Note: This p aper is a chapter in the forthcoming Handbook of Cluster Analysis, 


n oer is a 

2015). For definitions of basic clustering methods and some fur¬ 
ther methodology, other chapters of the Handbook are referred to. To read this 
version of the paper without the Handbook, some knowledge of cluster analysis 
metho dology is requ ired. 


In iHennig et al. ( 2015 1. a large number of cluster analysis methods have 


been introduced, and in any situation in which a clustering is needed, the user 
is faced with a potentially overwhelming number of options. The current paper 
is about how the required choices can be made. [Milliganl ( 1996 1 listed seven 
steps of a cluster analysis that require decisions, namely 


1. choosing the objects to be clustered, 

2. choosing the measurements/variables, 

3. standardization of variables, 

4. choosing a (dis-)similarity measure, 

5. choosing a clustering method, 

6. determining/deciding the number of clusters, 

7. interpretation, testing, replication, cluster validation. 

I will treat all but the first one (general principles of sampling and experimental 
design apply), not sticking exactly to this order. The chapter focuses on the 
general philosophy behind the required choices, what this means in practice, 
and on some areas of research. This has to be combined with knowledge on 
clustering methods as given elsewhere in this volume. Some more discussion of 
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the above iss ues can be found in Milliganl (1 19961) and standard cluste r analysis 
books s uch as Jain and Dubes! ( 19881) : Kaufman and Rousseeuw ( 199Clh : Gordon 
( 19991) : Everitt et al. ( 201lli . 

The point of view taken here, previously outlined in Hennig and Liaol (|20131 ) 


and also shared by other authors ( von Luxburg et al 


( 2012i lb is that there is no 
such thing as a universally ” best clustering method”. Different methods should 
be used for different aims of clustering. The task of selecting a clustering method 
implies a proper understanding of the meaning of the data, the clustering aim 
and the available methods, so that a suitable method can be matched to what the 
application requires. Although many experienced experts in the field, including 
the authors of the books cited above, agree with this view, there is not much 
advice in the literature on how the specific requirements of the application can 
be connected with the available methods. Instead, cluster analysis methods 
have been often compared on simulated data or data with known classes, in 
order to find a “best” one disregarding the research context. Such comparisons 
are of some use, particularly because they reveal, in some cases, that methods 
may not be up for what they were supposed to do. Still, it would be more useful 
to have more specific information about what kind of method is connected to 
what kind of clustering task, defined by clustering aim, required cluster concept, 
and potential structure in the data. 

The present chapter goes through the most essential steps of making the 
necessary decisions for a cluster analysis. It starts in Section [2] with a discus¬ 
sion of the background, relating the aims of clustering to the cluster concepts 
that may be of interest in a specific situation. Section [3] looks at the data to 
be clustered. Often it is useful to pre-process the data before applying a clus¬ 
tering method, by defining new variables, dissimilarity measures, transforming 
or selecting features. Such operations have an often fundamental impact on the 
resulting clustering. Note that I will use the term “features” to refer to the vari¬ 
ables eventually used for clustering if a cluster analysis method for an “objects 
times features”-matrix as input is applied, whereas the term “variables” will be 
used in a more general sense for measurements characterizing the objects used 
in the clustering process, potentially later to be used as clustering features, or 
for computing dissimilarity measures or new variables. 

Section [4] is on comparing clustering methods. This encompasses the deci¬ 
sion which method fits a certain clustering aim, measurement of the quality of 
clustering methods, benchmark simulation studies, and some theoretical work 
on characterizing clusterings and clustering methods. In many cases, though, 
there may not be enough precise information about the clustering aim and clus¬ 
ter concepts of interest, so that the user may not be able to pinpoint exactly 
what method is needed. Also, it may be discovered that the clustering structure 
of the data may differ from what was expected in advance, and other methods 
than initially considered may look promising. Section [5] is about evaluating and 
comparing outcomes of clustering methods, before the chapter is concluded. 
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2 Clustering aims and cluster concepts 

In various places in the literature it is noted that there is no generally accepted 
definition of a cluster. This is not surprising, given the many different aims for 
which clusterings are used. Here are some examples: 

• delimitation of species of plants or animals in biology, 

• medical classification of diseases, 

• discovery and segmentation of settlements and periods in archeology, 

• image segmentation and object recognition, 

• social stratification, 

• market segmentation, 

• efficient organization of data bases for search queries. 

There are also quite general tasks for which clustering is applied in many subject 
areas: 

• exploratory data analysis looking for “interesting patterns” without pre¬ 
scribing any specific interpretation, potentially creating new research ques¬ 
tions and hypotheses, 

• information reduction and structuring of sets of entities from any subject 
area for simplification, more effective communication, or more effective 
access/action such as complexity reduction for further data analysis, 

• investigating the correspondence of a clustering in specific data with other 
groupings or characteristics, either hypothesized or derived from other 
data. 


Depending on the application, it may differ a lot what is meant by a “cluster”, 
and this has strong implications for the methodological strategy. Finding an 
appropriate clustering method means that the cluster definition and methodol¬ 
ogy have to be adapted to the specific aim of clustering in the application of 
interest. 

A key distinction can be made between “realist” aims of clustering, con¬ 
cerning the discovery of some meaningful real structure corresponding to the 
clusters, and “constructive” aims, where researchers intend to split up the data 
into clusters for pragmatic reasons, regardless of whether there is some essen¬ 
tial real difference between the resulting groups. This distinction can be roughly 
connected to the choice of clu steri ng methodology. For example, some clustering 
criteria such as AT-means ( Hennig et al.l ( 2015h l produce homogeneous clusters 
in the sense that all observations are assigned to the closest centroid, and large 
distances within clusters are heavily penalized. This is useful for a number of 
constructive clustering aims. On the other hand, A'-means does not pay much 
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attention to whether or not the clusters are clearly separated by gaps, and does 
not tolerate large variance and spread of points within clusters, which can occur 
in clusters that correspond to real patterns (for example objects in images). 

However, the distinction between realist and constructive clustering aims 
is not as clear cut as it may seem at first sight. Categorization is a very ba¬ 
sic human activity that is directly connected with the emergence of language. 
Whenever human beings speak of real patterns, this can only refer to categories 
that are aspects of human cognition and ca n be ex p ressed in lan guage , which 
can be seen as a pragmatic human construct ( Van Mechelen et al. ( 19931 ) review 
cognitive theories of categorization with a view to connecting them to inductive 
data analysis including clustering). In a related manner, researchers with realist 
clustering aims should not hope that the data alone can reveal real structure; 
constructive impact of the researchers is needed to decide what counts as real. 

The key issue in realist clustering is how the real structure the researchers 
are interested in is connected to the available data. This requires subject matter 
knowledge, but it also requires decisions by the researchers. “Real structure” is 
often understood as the existence of an unobserved categorical variable the val¬ 
ues of which define the “true” clusters. Such an idea is behind the popular use 
of datasets with given true classes for benchmarking of cluster analysis meth¬ 
ods. But neither can it be taken fur granted that the known categories are the 
only existing ones that could qualify as “real clusters”, nor do such categories 
necessarily correspond to data analytic clusters. For example, male/female is 
certainly a meaningful categorization of human beings, but there may not even 
be a significant difference between men and women regarding the results of a 
certain attitude survey, let alone separated clusters corresponding to sex. Usu¬ 
ally the objects represented in the dataset can be partitioned into real categories 
in many ways. Also, different cluster analysis methods will produce different 
clusterings, which may more or less well correspond to patterns that are real 
in potentially different ways. This means that in order to decide about appro¬ 
priate cluster analysis methodology, researchers need to think about what data 
analytic characteristics the clusters they are aiming at are supposed to have. I 
call this the “cluster concept” of interest in a specific study. 

The real patterns of interest may be more or less closely connected to the 
available data. For example, in biological species delimitation, the concept of 
a species is often defined in ter ms o f inter breedi ng (there is some controversy 
about the precise definition, see lHausdorll ( 201lh ). But interbreeding patterns 
are not usually available as data. Species are nowadays usually delimited by 
use of genetic data, but in the past, and also occasionally in the present in 
an exploratory manner, species were seen as the source of a real grouping in 
phenotype data. In any case, the researchers need some idea about how true 
distinctions between species are connected to patterns in the data. Regarding 
genetic data, this means that knowledge needs to be used about what kind of 
similarity arises from persistent genetic exchange inside a species, and what kind 
of separation arises between distinct species. There may be subgroups of indi¬ 
viduals in a species between which there is little actual interbreeding (because 
potential interbreeding suffices for species definition), for example between ge- 
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ographically separated groups, and consequently not as much genetic similarity 
as one would naively expect. Furthermore there are various levels of classifica¬ 
tion in biology, such as families and genii above and subspecies below the level 
of species, so that data analytic clusters may be found at several levels, and the 
researchers may need to specify more precisely how much similarity within and 
separation between clusters is required for finding species. 

Such knowledge needs to be reflected in the cluster analysis method to be 
chosen. For example, species may be very heterogeneous regarding geographical 
distribution and size, and therefore a clustering method that implicitly tends to 
bring forth clusters that are very homogeneous such as K -means or complete 
linkage is inappropriate. 

In some cases, the data are more directly connected to the cluster defini¬ 
tion. In species delimitation, there may be interbreeding data, in which case 
researchers can specify the requirements of a clustering more directly. This may 
imply graph theoretic clustering methods and a specification of how much con¬ 
nectedness is required within clusters, although such decisions can often not be 
made precise because of missing information arising from sampling of individ¬ 
uals, missing data etc. On the other hand, the connection between the cluster 
definition and the data may be less close, as in the case of phenotype data used 
for delimiting species, in which case the researchers may not have strong infor¬ 
mation about how the clusters they are interested in are characterized in the 
data, and some speculation is needed in order to decide what kind of clustering 
method may produce something useful. 

In many situations different groupings can be interpreted as real, depending 
on the focus of the researchers. Social classes for example can be defined in 
various ways. Marx made ownership of means of production the major defin¬ 
ing characteristic of different classes, but social classes can also be defined by 
looking at patterns of communication a nd contact, or oc cu pati on, or education, 
or wealth, or by a mixture of these 1 Hennig and Liaol ( 2013 1). In this case, 
a major issue for data clustering is the selection of the appropriate variables 
and measurements, which implicitly defines what kinds of social classes can be 
found. 

The example of social stratification also illustrates that there is a gradual 
transition rather than a clear cut between realist and constructive clustering 
aims. According to some views (such as the Marxist one) social classes are an 
essential and real characteristic of society, but according to other views, in many 
societies there is no clear delimitation between social classes that could justify 
to call these classes “real”, despite the existence of real inequality. Social classes 
can then still be used as a convenient tool for structuring the inequality in such 
societies. 

Regarding constructive clustering aims, it is obvious that researchers need 
to decide about the desired “cluster concept”, or in other words, about the 
characteristics that their clusters should have. The discussion above implies 
that this is also the case for realist clustering aims, for which the required 
cluster concept needs to be derived from knowledge about the nature of the real 
clusters, and from a decision of the researchers about their focus of interest if 
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(as is usually the case) the existence of more than a unique real clusterings is 
conceivable. For constructive clustering, the required cluster concept needs to 
be connected to the practical use that is intended to be made of the clusters. 

Also where the primary clustering aim is constructive, realist aims may still 
be of interest insofar as if indeed some real grouping structure is clearly manifest 
in the data, many constructive aims will be served well by having this structure 
reflected in the clustering. For example, market segmentation may be useful 
regardless of whether there are really meaningfully separated groups in the data, 
but it is relevant to find them if they exist. 

Here is a list of potential characteristics of clusters that may be desired, 
and that can be checked using the available data. A number of these are re¬ 
lated with the “forma l ca tegorization principles” listed in Section 14.2.2.1 of 
Van Mechelen et al. (1993). 


1. Within-cluster dissimilarities should be small. 

2. Between-cluster dissimilarities should be large. 

3. Clusters should be fitted well by certain homogeneous probability models 
such as the Gaussian or a uniform distribution on a convex set, or, if 
appropriate, by linear, time series or spatial process models. 

4. Members of a cluster should be well represented by its centroid. 

5. The dissimilarity matrix of the data should be well represented by the 
clustering (i.e., by the ultrametric induced by a dendrogram, or by defining 
a binary metric “in same cluster/in different clusters”). 

6. Clusters should be stable. 

7. Clusters should correspond to connected areas in data space with high 
density. 

8. The areas in data space corresponding to clusters should have certain 
characteristics (such as being convex or linear). 

9. It should be possible to characterize the clusters using a small number of 
variables. 

10. Clusters should correspond well to an externally given partition or values 
of one or more variables that were not used for computing the clustering. 

11. Features should be approximately independent within clusters. 

12. All clusters should have roughly the same size. 

13. The number of clusters should be low. 
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When trying to measure these characteristics, they have to be made more pre¬ 
cise, and in some cases it matters a lot how exactly they are defined. Take no. 
1, for example. This may mean that all within-cluster dissimilarities should 
be small without exception (i.e., the maximum should be small, as required by 
complete linkage hierarchical clustering), or their average, or a high quantile of 
them. These requirements may look similar at first sight but are very different 
regarding the integration of outliers in clusters. Having small within-cluster 
dissimilarities may emphasize gaps by looking at the smallest dissimilarities be¬ 
tween each two clusters, or it may rather mean that the central areas of the 
clusters are well distributed in data space. As another example, stability can 
refer to sampling other data from the same population, to adding “noise”, or to 
comparing results from different clustering algorithms. 

Some of these characteristics are in conflict with others in some datasets. 
Connected areas with high density may include very large distances, and may 
have undesired (e.g., non-convex or nonlinear) shapes. Representing objects by 
centroids may bring forth some clusters with little or no gap between them. 
Having clusters of roughly equal size forces outliers to be integrated in distant 
clusters, which produces large within-cluster dissimilarities. 

Deciding about such characteristics is the key to linking the clustering aim to 
an appropriate clustering method. For example, if a database of images should 
be clustered so that users can be shown a single image to represent a cluster, 
no. 7 is most important. Useful market segments need to be addressed by non¬ 
statisticians and should therefore normally be represented by few variables, on 
which dissimilarities between members should be low. Similar considerations 
can be made for realist clustering aims, see above. 

For choosing a clustering method, it is then necessary to know how they cor¬ 
respond to the required characteristics. Some methods optimize certain char¬ 
acteristics directly (such as AT-means for no. 4), and in some further cases 
experience and research suggest typical behavior (AT-means tends to produce 
clusters of roughly equal size, whereas methods looking for high-density areas 
may produce clusters of very variable size). See Section [4. II for more comments 
on specific methods. Other characteristics such as stability are not involved in 
the definition of most clustering methods, but can be used to validate clusterings 
and to compare clusterings from different methods. 

The task of choosing a clustering method is made harder by the fact that in 
many applications more than one of the listed characteristics is relevant. Clus¬ 
terings may be used for several purposes, or desired characteristics may not be 
well defined, e.g., in exploratory data analysis, or for realist clustering aims in 
cases where the connection between the interpretation of the clusters and the 
data is rather loose. Also, a misguided desire for uniqueness and objectivity 
makes many researchers reluctant to specify desired characteristics and choose 
a clustering method accordingly, because they hope that there is a universally 
optimal method that will just produce “natural” clusters. Probably for such 
reasons there is currently almost no systematic research investigating the char¬ 
acteristics of methods in terms of the various cluster characteristics. 


3 Data processing 

The decision about what data to use, including how to choose, transform and 
standardize variables, and if and how to compute a dissimilarity measure, is an 
important part of the methodological strategy in cluster analysis. It often has 
a major impact on the clustering result, and is sometimes more important than 
the choice of the clustering method. 


3.1 Choice of representation 


To some extent the data format restricts the choice of clustering methods; there 
are specialized methods for continuous, ordinal, categorical and mixed type 
data, dissimilarity data, graphs, time series, spatial data etc. But often data 
can be represented in different ways. For example, a collection of time series 
with 100 time points can be represented as points in 100-dimensional Euclidean 
space, but they can also be represented by autocorrelation parameters of a time 
series model fitted to them, by wavelet features or some other low dimensional 
representation, or by dissimil arity meas ures which may involve some alignment 


or “time warping”, see Hennig et ah (120151 ). On the other hand, dissimilarity 


data can be be transformed to Euclidean data using multidimensional scaling 
(MDS) techniques. This means that the researcher often can choose whether 
the objects are represented by features, dissimilarities, or in another way, for 
example by vertices in a graph. 

Generally, dissimilarity measures are a suitable basis for clustering if the clus¬ 
ter concept is mainly based on the idea that similar objects should be grouped 
together and dissimilar objects should be in different clusters. Dissimilarity 
measures can be constructed for most data types. On the other hand, clusters 
characterized by distributional and geometrical shapes and clusters with poten¬ 
tially high within-cluster variability or skewness are found better with objects 
characterized by features instead of dissimilarities. 

The choice of representation should be guided by the question how objects 
qualify to belong together in the same cluster. For example, if the data are 
time series, there are various different possible concepts of “belonging together”. 
Time series may belong together if their values are similar most of the time, 
which is appropriate if the plain values play a large role in the assessment of 
similarity (for example cigarettes smoked per day in research about smoking 
behavior). A musical melody can be played at different speeds and in different 
keys, so that two musical melodies may still be assessed as similar despite pitch 
values being quite different and changes in pitch happen at different times. In 
other applications, such as particle detection by electrodes, the characteristics of 
a single event that happens at a certain potentially flexible time point (such as 
a value going up and then down again) may be important, and having detected 
such an event, some specific characteristics of it may represent the objects in 
the most useful manner. 

A central issue regarding the representation is the choice of variables that are 
either used as features to represent the objects or on which a dissimilarity defini- 
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tion is based. Both subject matter and statistical considerations play a role here. 
From a statistical point of view, a variable could be seen as uninformative if it 
is either strongly correlated with other variables and does not carry information 
on its own as long as certain other variables are kept, or the variable may not 
be connected to any “real” clustering characterized in the data for example by 
high density regions. Furthermore, in some situations (for example using gene 
expression data) the number of variables may simply be so large that cluster 
analysis methods become unstable. There are va rious automatic methods for 
variable selection in connection with clustering, see lHenni g et al.[(2015) for clus¬ 
tering variables at the same time as observations, and Alelvani et al.1 ( 20l4l for 
a recent survey. Popular classical methods such as principal component analysis 
(PCA) and AIDS are occasionally used for constructing informative variables. 
These, however, are based on objective functions (variance, stress) that do not 
have any relation to clustering, and may therefore miss information that is im¬ 
portant for clustering. There are some projection pursuit-type methods that 
aim at finding low-dimensional representation s of the data partic ularly suitable 
for clustering ( Bolton and Krzanowskil ( 2003 1: Tvler et all ( 2009ll '). 

It is important to realize, though, that the variables involved in clustering 
define the meaning of the clusters. Changing the variables implies changing the 
meaning. If the researchers have a clear idea about the meaning of the clusters 
of inte rest , it is pro blema tic to select variables in an automatic manner. For ex¬ 
ample, Hennig and Liaol ( 2013 1 were interested in socio-economic stratification, 
for which information on income, savings, education and housing is essential. 
Even if for example incomes do not show any clear grouping structure, or are 
correlated strongly with another variable, this does not constitute a valid reason 
to exclude this variable for constructing a clustering that is meant to reflect a 
meaningful socio-economic partition of a population. A stratification based on 
automatically selected variables that cluster in a nicer way may be of exploratory 
interest, but does not fulfill the aim of the researchers. One could argue that 
in case of correlation between income and another variable, savings, say, the 
information from income is retained as long as savings (or a linear combination 
of them both, as would be generated by PCA) is still used as a feature for clus¬ 
tering. But this is not true, because the fact that the information is shared by 
two variables that in terms of their meaning are essential for the clustering aim 
is additional information that should not be lost. 

Another issue is that variables can play different roles, which has different 
implications. For example, a dataset may include spatial coordinates and other 
variables (e.g., regional data on avalanche risk, or color information in image 
segmentation). Depending on the role that the spatial information should play, 
spatial coordinates can be included in the clustering process as features together 
with the others (which implies that regional similarity will somehow be traded 
off against similarity regarding the other variables in the clustering process), 
or they could define constraints (e.g., clusters on the other variables could be 
constrained to be spatially connected), or they could be ignored for clustering, 
but could be used afterward to validate the resulting clusters or to analyze their 
spatial structure. For avalanche risk mapping, for example, one may take the 
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latter approach for detailed maps if spatial information is discretized and there 
is enough data at each point, but one may want to impose spatial constraints if 
data is sparser or if the map needs to be coarser because it is used by decision 
makers instead of hikers. 

Often there is a good reason for not choosing the variables automatically 
from the data, but rather guided by the aim of clustering. In some cases dimen¬ 
sion reduction can be achieved by the definition of meaningful new indices sum¬ 
marizing information in certain variables. On the other hand, automatic variable 
selection may yield interesting clusterings if the aim is mainly exploratory, or 
if there is no prior information about the importance of the variables and it is 
suspected that some of them are uninformative “noise”. 


3.2 Dissimilarity definition 

In order to apply dissimilarity based methods and to measure whether a clus¬ 
tering method groups similar observations together, a formal definition of “dis¬ 
similarity” is needed (or “proximity”, which refers to either dissimilarity or 
similarity, as sometimes used in the literature; their treatment is equivalent 
and there are a number of transformations between dissimilarity and similarity 
measures, the simplest and most popular of which probably is “dissimilarity= 
maximum similarity minus similarity”). In many situations, dissimilarities be¬ 
tween objects cannot be measured directly, but have to be constructed from 
some measurements of variables of the objects of interest. Directly measured 
dissimilarities occur for example in comparative experiments in psychology and 
market research. 

There is no unique “true” dissimilarity measure for any dataset; the dissimi¬ 
larity measurement has to depend on the researchers’ concept of what it means 
to treat two objects as “similar”, and therefore on the clustering aim. 

Mathematically, a dissimilarity is a function d : X 2 i—)• JR , X being the object 
space, so that d(x, y) = d{ y, x) > 0 and d( x, x) = 0 for x, y £ X. There is some 
work on asymmetric dissimilarities ( Okadal (2000h ) a nd multiway dissimilarities 
defined between more than two objects ( Diattal ( 2004 11. A dissimilarity fulfilling 
the triangle equality 


d(x, y) + d{ y, z) > d(x, z), x, y, z <E X, 

is called a “distance” or “metric”. The triangle inequality is connected to Eu¬ 
clidean intuition and therefore see ms to be a “natur al” requirement, but in some 
applications it is not appropriate. Hennig and Hausdorfl ( 2006h argue, e.g., that 
for presence-absence data of species on regions two species A and B are very dis¬ 
similar if they are present on two small disjoint areas, but both should be treated 
as similar to a species C covering a larger area that includes both A and B, if 
clusters are to be interpreted as species grouped together by palaeoecological 
processes. 

A vast number of dissimilarity measures has been proposed, some for rather 
general purposes, some for more specific applications (dissimilarities between 
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shapes ( Veltkamp and Latec ki (2006)), melodies (Mullensiefen and Frieled(12007 )). 
geograp hical species d i stribu tion areas ( Hennig and HausdorJ l 20061) ). etc.). Chap¬ 
ter 3 in lEveritt et al. ( 2011 ) gives a good overview of general purpose dissimi¬ 
larities. Here are some basic considerations: 


Aggregating binary variables. If two objects xi,X 2 are represented by p 
binary variables, let a^- be the number of variables h = 1,... ,p on which 
Xih = i, X 2 h = j, i,j £ {0,1}. If all variables are treated in the same way, 
the most straightforward dissimilarity is the simple matching coefficient, 


dsM(xi,x 2 ) = 1 - 


(Too + Oil 
p 


However, often (e.g. in the case of geographical presence-absence data in 
ecology) common presences are important, whereas common absences are 
not. This is taken into account by the Jaccard dissimilarity 


dj(x i,x 2 ) = 1 - 


O-ll 


on + am + aoi 


One can worry about whether this gives the object with more presences too 
much weight in the denominator, and actually m ore t han 30 dissimilarity 
measures for such data have been proposed Ishil ( 19931) . prompting much 
research about their c harac t eristics and how they relate to each other 
( Gower and Legendrel ( 1986 ): Warrensl ( 20081) ). 


Aggregating categorical variables. If there are more than two categories, 
again the most intuitive way to construct a dissimilarity measure is one 
minus the relative number of “matches”. In some applications such as 
population genetics dissimilarity should rather be a non-linear function of 
matches between genes, and it is also important to think about whether 
and in what way variables with different numbers of categories or even 
with more or less uniform distributions should be given different weights 
because some variables produce matches more easily than others. 

Aggregating continuous variables. The Minkowski (L g )-distance between 
two objects Xj, Xj on p real-valued variables x,; = (xn, ..., Xi P ) is 


d-Mq (X j 


\ 


y ^di(xu,Xji)i, 


( 1 ) 


1=1 


where di(x,y) = \x — y\. Variable weights wi can easily be incorporated 
by multiplying the di by wp Most often, the Euclidean distance g(m 2 and 
the Manhattan distance d-Mi are used. Using dMq with larger q gives the 
variables with larger di more weight, i.e., two observations are treated 
as less similar if there is a very large dissimilarity on one variable and 
small dissimilarities on the others than if there is about the same medium¬ 
sized dissimilarity on all variables, whereas dMi gives all variable-wise 
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contributions implicitly the same weight (note that this does not hold for 
the Euclidean distance that corresponds to physical distances and is used 
as default choice in many applications). 

An alternative would be the (squared) Mahalanobis distance, 

dM(Xi,Xj) 2 = (X; - X v ) 7 S _1 (X, - Xj), (2) 


where S is a scatter matrix such as the sample covariance matrix. This 
is affine equivariant, i.e., not only rotating the data points in Euclidean 
space, but also stretching them in any number of directions will not affect 
the dissimilarity. It will also implicitly aggregate and therefore weight 
information from strongly correlated variables down (correlation implies 
that data are “stretched” in the direction of their dependence; the conse¬ 
quence is that “joint information” is only used once). This is desirable if 
clusters can come in in all kinds of elliptical shapes. On the other hand, it 
means that the weight of the variables is determined by their covariance 
structure and not by their meaning, which is not always appropriate (see 
the discussion about variable selection above). 


There are many further ways of constructing a diss imilarity measure from 
several continuous variables, see Everitt et al. i (fioTH) , such as the Canberra 
distance, which emphasizes differences close to zero. It is defined by q = 1 
and di(x,y) = in ([]]). The Pearson correlation coefficient p(x, y) 


has been used to construct a dissimilarity measure dp(x, y) = 1 — 
as well (other transformations are also used). This interprets x and y 
as similar if they are positively linear dependent. This does not mean 
that their values have to be similar, but rather the values of the variables 
relative to the other variables. In some applications variables are clustered, 
which means that variables and objects change their roles; if the variables 
are the objects to be clustered, p in dp is a proper correlation between 
variables, which is a typical use of dp. 


Aggregating ordinal variables. Ordinal variables are characterized by the 
absence of metric information about the distances between two neighbor¬ 
ing categories. They could be treated as categorical variables, but this 
would ignore available information. On the other hand, it is fairly com¬ 
mon practice to use plain Likert codes 1,2,... and then to use methods for 
continuous data. Ordinality can be taken into account while still using 
methods for continuous data by scoring the categories in a way that uses 
the ordinal information only. Straightforward scores are obtained by rank- 
i ng (using the midrank for all objects in one category) or normal scores 
( Conoveif l 19991) 1. which treat the data as if there would be an underlying 
uniform (ranks) or Gaussian distribution (normal scores). A mo re sophis¬ 
ticate d approach is polytomous item response theory ( Ostini and Neringl 
( 20061 )). Using scores that are determined by the distribution of the data 
does not guarantee that they appropriately quantify the interpretative 
distances between categories, and in some situations (e.g., Likert scales in 
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questionnaires where interviewees can see that responses are coded 1,2,...) 
this may be reflected better by plain Likert codes. Sometimes also there is 
a more complex structure in the categories th at can be reflected by sc or- 
ing data in a customized way. For example, in Hennig and hi ad ( 2013 ). a 
“housing” variable had levels “owns”, “pays rent” and several levels such 
as “shared ownership” that could be seen as lying between “owns” and 
“pays rent” but could not be ordered, which could be reflected by having a 
distance of 1 between “pays rent” and “owns” and 0.5 between any other 
pair of categories. 

Aggregating mixed-type variables and missing values. If there are variable- 
wise distances di defined, variables of mixed type can be aggregate d. A 
standard way of doing this is the Gower dissimilarity d Gower ( 1971 V) 


d G (xi,Xj) = 


Ef=i wiSijidi(xii,Xji) 

Ef= 1 


where wi is a variable weight and 6iji = 1 except if Xu or Xji are missing, 
in which case 6iji = 0. This is a weighted version of cZmi and takes into 
account missing values by just leaving the corresponding variable out and 
rescaling the others. Gower recommended to use t he we ight wi fo r stan - 
dardization to [0, l]-range (see Section [3~~IT) . but [Hennig and Liaol ( 2013 ) 
argued that many clustering methods tend to identify gaps in variable 
distributions with cluster borders, and that this implies that wi should be 
used to weight binary and other “very discrete” variables down against 
continuous variables, because otherwise the former would get an unduly 
high influence on the clustering, wi can also be used to weight variables 
up that have high subject matter importance. The Gower dissimilarity is 
very general and covers most applications of dissimilarity-based cluster¬ 
ing to mixed-type variables. An alternative for missing values is to treat 
them as an own category. For continuous variables one could give missing 
va lues a constant dis similarity to every other value. More references are 
in Everitt et al.1 ( 201 lh . 


Custom-made dissimilarities for structured data. In many situations de¬ 
tailed considerations regarding the subject matter will play the most im¬ 
portant role regarding the design of a dissimilarity measure. This is par¬ 
ticularly the case if the data are more structured than just a collection of 
variables. Such considerations start with deciding how to represent the 
objects, as discussed in Section 13.11 and illustrated by the task of time 
series clustering. The next task is how to aggregate the measurements in 
an appropriate way. In time series clustering, one consideration is whether 
some processes that are interpreted to be similar may occur at different 
and potentially varying speeds, so that flexible alignment (“dynamic time 
warping”) is required, as may be the case in gesture recognition. See lLiao 
( 20051) for further aspects of choosing dissimilarities between time series. 
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Key issues may differ a lot from one application to the next, so it is dif¬ 
ficult to present general rules. There is some research on approximating 
expert .judgmen t s of similarity with func t ions o f t he available variables 
(Gordon ( 1990h : iMullensiefen and Frieler (2007)). Hennig and Hausdorl 
( 20061 ). who incorporate geographical distance information into a dissim¬ 
ilarity for presence-absence data, list a number of general principles for 
designing and fine-tuning dissimilarities: 


• What should be the basic behavior of the dissimilarity as a function 
of the existing measurements (when decreasing/increasing etc.)? 

• What should be the relative weight of different aspects of the ba¬ 
sic behavior? Should some aspects be incorporated in a nonlinear 
manner (see Section Oil ? 

• Construct exemplary pairs of objects for which it is clear what value 
the dissimilarity should have, or how it should compare with some 
other exemplary pairs. 

• Construct sequences of pairs of objects in which one aspect changes 
while others are held constant. 

• Whether and how could the dissimilarity measure be disturbed by 
small changes in the characteristics? What behavior in these situa¬ 
tions would be appropriate? 

• Which transformations of the variables should leave the dissimilari¬ 
ties unchanged? 

• Are there reasons that the dissimilarity measure should be a metric 
(or have some other particular mathematical properties)? 


3.3 Transformation of variables 


According to the same philosophy as before, effective distances (as used by a 
clustering method) on the variables should reflect the “interpretative distance” 
between objects, and transformations may be required to achieve this. Because 
there is a large variety of clustering aims, it is difficult to give general principles 
that can be applied in a straightforward manner, and the issue is best illus¬ 
trated using examples. Therefore, consider n ow the v aria ble “savings amount” 
in socio-economic stratification in iHennig and Liao ( 2013f h Regarding social 
stratification it makes sense to allow proportionally higher variation within high 
income and/or high savings clusters; the “interpretative difference” between in¬ 
comes is rather governed by ratios than by absolute differences. In other words, 
the difference between two people with yearly incomes of $ 2 million and $ 4 
million, say, should in terms of social strata be treated about equally as the 
difference between $ 20,000 and $ 40,000. This suggests a log transformation, 
which has the positive side effect to tame some outliers in the data. Some people 
indeed have zero savings, which means that the transformation should actually 
be log(savings)+c. The choice of c can have surprisingly strong implications on 
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clustering, because it tunes the size of the “gap” between persons with zero sav¬ 
ings a nd persons with small savings; in the dataset analyzed in [Henni g and Liao 
( 2013h there were only a handful of persons with savings below $ 100, but more 
with savings between $ 100 and $ 500. Clustering methods tend to identify 
borders between clusters with gaps. A low value for c, e.g., c = 1, creates a 
rather broad gap, which means that many clustering methods will isolate the 
zero savings-group regardless of the values of the other variables. However, from 
the point of view of socio-economic stratification, zero savings are not that spe¬ 
cial and not essentially different from low savi ngs b e low a few hundred dollars, 
and therefore a larger value for c ( Hennig and Liaol ( 20131) chose c = 50) needs 
to be chosen to allow methods to put such observations together in the same 
cluster. The reasoning may seem to be very subjective, but actually this is 
required when attention is paid to the detail, and there is no better justification 
for any straightforward default choice (e.g., c = 1). 

It is fairly common that “interpretative distances” are nonlinear functions 
of plain differences. As another example, Hennig and Hausdorfl ( 20061 ) used 
geographical distance information in a nonlinear way in a dissimilarity measure 
for presence-absence data for biological species, because individuals can easily 
travel shorter distances, whereas what goes on in regions with a long distance 
between them is rather unrelated, regardless of whether this distance is, say, 
2,000 or 4,000 km, the difference between which therefore should rather be 
scaled down compared to differences between smaller distances. 

Whether such transformations are needed depends on the clustering method. 
For example, a typical distribution of savings amounts is very skew and some¬ 
times the skewness corresponds to the change in interpretative distances along 
the range o f the variable. Fitting a mixture of appropriate skew distributions 


(see 


variable. 


Hennig et al. ( 20151) ) can then have a similar effect as transforming the 


3.4 Standardization, weighting and sphering of variables 

Standardization of variables is a kind of transformation, but with a different 
rationale. Instead of governing the effective distance within a variable, it gov¬ 
erns the relative weight of variables against each other when aggregating them. 
Standardization is not needed if a clustering method or dissimilarity is used 
that is invariant against affine transformations such as Gaussian mixture mod¬ 
els allowing for flexible covariance matrices or the Mahalanobis distance. Such 
methods standardize variables internally, and the following considerations may 
apply also to the question whether it is a good idea to use such a method. 

Standardization of xi,..., x„ £ 1R P is a special case of the linear transfor¬ 
mation 

x * = B _1 (xi - //), i = 1, 

where B is an invertible p x p-matrix and p £ IR P . Standardizing location by 
introducing p (usually chosen as the mean vector of the data) does not normally 
have an influence on clustering, but simplifies expressions. “Standardization” 
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refers to using a diagonal matrix of scale statistics (see below) as B. For “spher¬ 
ing”, B = UD 1/2 , where S = UDU' for a scatter matrix S, with U being the 
matrix of eigenvectors and D being the diagonal matrix of eigenvalues. 

If the clustering method is not affine invariant (for example A'-means or 
dissimilarity-based methods using the Euclidean distance), standardization may 
have a large impact. For example, if variables are measured on different scales 
and one variable has values around 1,000 and another one has values between 0 
and 1, the first variable will dominate the clustering regardless of what cluster¬ 
ing pattern is supported by the second one. Standardization makes clustering 
invariant against the scales of the variables, and sphering makes clustering in¬ 
variant against general affine linear transformations. 

But standardization and sphering are not always desirable. The effect of 
sphering is the same as the effect of using the Mahalanobis distance ©, dis¬ 
cussed above. If variables use the same measurement scale but have different 
variances, it depends on the requirements of the application whether standard¬ 
ization is desirable or not. For example, data may come from a questionnaire 
where respondents were asked to rate several items on a scale between 1 and 10. 
If for some items almost all respondents picked central values between 4 and 7, 
this may well indicate that the respondents did not find these items very interest¬ 
ing, and that therefore these items are less informative for clustering compared 
with other items for which respondents made a good use of the full width of 
the scale. Fur standard clustering methods that are not affine invariant, the 
variation within a variable defines its relative impact on the clustering. Leaving 
the items unstandardized means that an item with little variation would have 
little impact on clustering, which seems appropriate in this situation, whereas in 
other applications one may want to allow the variables a standardized influence 
on clustering regardless of the within-variable variation. 

The most popular methods for standardization are 

• standardization to [0, l]-range, 

• standardization to unit variance, 

• standardization to a unit value of a robust variance estimator such as 
interquartile range (IQR) or median absolute deviation (MAD) from the 
median. 

As is the case for most such decisions, the standardization method occasionally 
makes a substantial difference. The major difference is the treatment of outly¬ 
ing values. Range standardization is vulnerable to outlying values in the sense 
that an extreme outlier has the effect of squeezing together the other values on 
that variable, so that any structural information in this variable apart from the 
outlier will only have a very small influence on the clustering. This is avoided 
by using a robust variance estimator, which can have another undesired effect. 
Although outliers on a single variable will not affect other structural information 
on the same variable so much, for objects for which a single variable has an out¬ 
lying value, this may dominate the information from all other variables, which 
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can have a big impact in situations with many variables and a moderate number 
of outlying values in various variables. Variance standardization compromises 
on the disadvantages of both other approaches as well as on the advantages. 

If for subject matter reasons some variables are more important than others 
regardless of the within-variable variation, one could reweight them by mul¬ 
tiplying them with constants reflecting the relative importance after having 
standardized their data-driven impact. 

None of the methods discussed up to here takes clustering information into 
account. A problem here is that if a variable shows a clear separation between 
clusters, this may introduce large variability, which may imply a large variance, 
range or IQR/MAD. If variables use the same measurement units and values are 
comparable, this could be an argument against standardization; if within-cluster 
variation is low, ra nge-standardizati on will normally be better than the other 
schemes (Mill igan and Coopeil (1988)). The proble m is, obviously, th at cluster¬ 
ing information is not normally available a priori. Art et alj (1982|) discuss a 
method in which there is an initial guess, based on smallest dissimilarities, which 
objects belong to the same cluster, from which then a provisional with in-cluster 
covariance matrix is estimated, which is used to sphere the dataset, De Soet j 


( 1986h suggests to re weight variabl e s in s uch a way that an ultrametric is opti¬ 
mally approximated (Hcnniget^d. ( 2015lll. The s e met hods are compared with 


classical standardization by Gnanadesikan et_al. (119951 ). 


4 Comparison of clustering methods 

Different cluster analysis methods can be compared in several different ways. 
When choosing a method for a specific clustering aim, it is important to know the 
characteristics of the clustering methods so that they can be matched with the 
required cluster concept. This is treated in Section [TT] Section HT21 reviews some 
existing studies comparing different clustering methods. Section |T3] summarizes 
some theoretical work on desirable properties of clustering methods. 


4.1 Relating methods to clustering aims 

Following Section [2J the choice of an appropriate clustering method is strongly 
dependent on the aim of clustering. Here I list some clustering methods treated 
in this book, and how they relate to the list of potentially desirable cluster 
characteristics given in Section [2j Completeness cannot be ach ieve d bec ause o f 
space limitations. For definitions of all listed methods, see lHennig et al.l ( 2015h . 


A'-means. The objective function of A'-means implies that it aims primarily at 
representing clusters by centroids. The squared Euclidean distance penal¬ 
izes large distances within clusters strongly, so outliers can have a strong 
impact and there may be small outlying clusters, although A'-means gen¬ 
erally rather tends to produce clusters of roughly equal size. Distances in 
all directions from the center are treated in the same way and therefore 
clusters tend to be spherical (AT-means is equivalent to ML-estimation in 
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a model where clusters are modeled by spherical Gaussian distributions). 
AT-means emphasizes homogeneity rather than separation; it is usually 
more successful regarding small within-cluster dissimilarities than regard¬ 
ing finding gaps between clusters. 

A'-medoids is similar to A'-means, but it uses unsquared dissimilarities. This 
means that it may allow larger dissimilarities within clusters and is some¬ 
what more flexible regarding outliers and deviations from the spherical 
cluster shape. 


Hierarchical methods. A first consideration is whether a full hierarchy of 
clusters is required (for example because the dissimilarity structure should 
be approximated by an ultrametric) or whether using a hierarchical method 
is rather a tool to find a single partition by cutting the hierarchy at some 
point. If only a single partition is required, hierarchies are not as flexible 
as some other algorithms for finding an in some sense optimal clustering 
(this applies, e.g., to comparing Ward’s hierarchical method with good al¬ 
gorithms for the A'-means objective function as reviewed in Hennig et al.l 
(120151 )'). Different hierarchical methods produce quite different clusters. 


Both Single and Complete Linkage are rather too extreme for many appli¬ 
cations, although they may be useful in a few specific cases, single linkage 
focuses totally on separation, i.e., keeping the closest points of different 
clusters apart from each other, and Complete Linkage focuses totally on 
keeping the largest dissimilarity within a cluster low. Most other hierar¬ 
chical methods are a compromise between these two extremes. 


Spectral clustering and graph theoretical methods. These methods are 
not governed by straightforward objective functions that attempt to make 
within-cluster dissimilarities small or between-cluster dissimilarities large. 
Spectral clustering is connected to Single Linkage in the sense that its 
“ideal” clusters theoretically correspond to connected components of a 
graph. However, spectral clustering can be set up in such a way (de¬ 
pending sometimes strongly on tuning decisions such as the how the edge 
weights are computed) that it works in a smoother and more flexible way 
than Single Linkage, less vulnerable to single points “chaining” clusters. 
Generally spectral clustering still can produce very flexible cluster shapes 
and focuses much more on cluster separation than on within-cluster ho¬ 
mogeneity when applied to originally Euclidean data in the usual way, i.e., 
using a strongly concave transformation of the dissimilarities so that the 
method focuses on the smallest dissimilarities, i.e., the neighborhoods of 
points, whereas pairs of points with large dissimilarity can still be con¬ 
nected through chains of neighborhoods. 


Mixture models. The distributional assumptions for such models define “pro¬ 
totype clusters”, i.e., the characteristics of the clusters the methods will 
find. These characteristics can depend strongly on details. For example, 
the Gaussian mixture model with fully flexible covariance matrices has a 
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much larger flexibility (which often comes with stability issues and may 
incur quite large within-cluster dissimilarities) than a model in which co- 
variance matrices are assumed to be equal or spherical. Using mixtures 
of t— or very skew distributions will allow observations within clusters 
that are quite far away from the cluster cores. Generally, the mixture 
model does not come with implicit conditions that ensure the separation 
of clusters. Two Gaussian distributions can be so close to each other that 
their mixture is unimodal. Still, for a large enough dataset, the BIC will 
separate the two components, which is only beneficial if the clustering aim 
allows to split up data subsets that seem rather homo geneous (the idea 
of merging such mixture components is discussed in lHennig et al. ( 20151 )). 
This issue is also important to have in mind when fitting mixture models 
to structural data; slight violations of model assumptions such as linearity 
may lead to fits by more “clusters” that are not well separated, if the BIC 
is used to determine the number of mixture components. Standard latent 
class models for categorical data assume local independence within clus¬ 
ters, which means that clusters can be interpreted in terms of the marginal 
distributions of the variables, which may be useful but is also restrictive, 
and allows large within-cluster dissimilarities. The comments here apply 
for Bayesian approaches as well, which allow the user to “tune” the behav¬ 
ior of the methods through adjustment of the prior distribution, e.g., by 
penalizing methods with more clusters and parameters in a stronger way. 
This can be a powerful tool for regularization, i.e., penalizing troublesome 
issues such as zero variances and spurious clusters. On the other hand, 
such priors may have unwanted implications. For example, the Dirich- 
let prior implies that a certain non-uniform distribution of cluster sizes is 
supported. 


Clustering time series, functional data and symbolic data. As was al¬ 
ready discussed in Section 13.11 regarding time series and also functions 
and symbolic data, a major issue to decide is in what sense the sequences 
of observations should belong together in a cluster, which could mean for 
example similar values, similar functional shapes (with or without align¬ 
ment or “time warping”), similar autocorrelation structure, or good ap¬ 
proximation by prototype objects. This is what mainly distinguishes the 
many methods discussed in these chapters. 


Density-based methods. Identifying clusters with areas of high density seems 
to be very intuitive and directly connected to the term “cluster”. High 
density areas can have very flexible shapes, but more sophisticated density- 
based methods do not depend as strongly on one or a few points as Single 
Linkage, which can be seen as a density-based method. There are a few 
potential peculiarities to keep in mind. High density areas may vary a lot 
in size, so they may include very large dissimilarities and there may be 
much variation in numbers of points per cluster. In different locations in 
the same dataset, depending on the local density, different density levels 
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may qualify as “high”, and methods looking for high density areas at vari¬ 
ous resolutions can be useful. Clusters may also be identified with density 
modes, which occur at potentially very different density levels. Density- 
based methods usually do not need the number of clusters specified, but 
rather their resolution, i.e., size of neighborhood (in terms of number of 
neighbors or radius), grid size or kernel bandwith. This determines how 
large gaps in the density have to be in order to be found as cluster borders 
and is often not easier than specifying the number of clusters. In higher 
dimensions, it becomes more difficult for clustering algorithms to figure 
out properly where the density is high or low, and also the sparsity of data 
in high-dimensional space means that densities tend to be more uniformly 
low. 


4.2 Benchmarking studies 

Different clustering methods can be compared based on datasets in which a 
true clusterin g is known. There are three basic approaches for this in the litera¬ 
ture (see lHennid ( 2015f) for more discussion and some philosophical background 
regarding the problem of defining the “true” clusters): 


1. Real datasets can be used in which there are known classes of some kind 
(a problem with this is that there is no guarantee that the known “true” 
classes are the only ones that make sense, or that they even cluster prop¬ 
erly at all). 

2. Data can be simulated from mixture or fixed partition models where 
within-cluster distributions are homogeneous, such as the Gaussian or uni¬ 
form distribution (it depends on the separation of the mixture components 
whether these can be seen as separated clusters; also such datasets will 
naturally favor clustering methods that are based on the corresponding 
model assumptions). 


3. Real data can be used for which there is no knowledge of a true clustering. 


Measures as introduced in Hennig et al. (|2015l ) such as the adjusted Rand index 
can then be used in order to compare the results of clustering methods with 
the true clusterings in the first two approaches. Measuring the quality of the 
clusterings for the third approach is less straightforward, and this is used less 
often. iMorev et al.l ( 19831) . for example, used a dataset of 750 alcohol abusers on 
some socio-behavioral variables, and measured quality by external validation, 
i.e. looking at the discrimination of the clusters by some external variables, and 
by splitting the data into two random subsamples, clustering both, and using 
nearest centroid allocation for computing a similarity measure of the clustering 
of the different subsamples. Another approach is to compare dissimilarity data 
to the ultram etric induced by a h ierarchical clustering using th e cophenetic 
correlation, see Hennig et al.l ( 2015h . as done by Saracli et al.l ( 20131 ) for artificial 
data. 
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At first sight it seems to be a very important and promising project to 
compare clustering methods comprehensively, given the variety of existing ap¬ 
proaches that is often confusing for the user. Unfortunately, the variety of 
clustering aims and cluster concepts and also the variety of possible datasets, 
both regarding data analytic features such as shape of clusters, number of clus¬ 
ters, separation of clusters, outliers, noise variables, and regarding data formats 
(Euclidean, ordinal, categorical variables, number of variables, structured data, 
dissimilarity data of various different kinds) makes such a project a rather un¬ 
realistic prospect. 

In the 1970s and 1980s, with less methodology already existing, a number of 
comparative benchmark studies were run on artificial data, usually focusing on 
standard hierarchical methods and different A'-means-type algorithms. Some of 
t hese (t he most comprehensive of which was Milliganl ( 1980h ) are summarized in 
Milligan ( 1996t h As could be expected, results depended heavily on the features 


of the datasets. Overall, Ward’s hierarchical clustering seemed rather successful 
and single linkage seemed problematic, although at least the first result may be 
biased to some extent by the data generation processes used in these studies. 

More recent studies tend to focus on more speciali st issues su ch as compar¬ 
ing different algorithms for the A-means criterion ( Brusco and Steinlevl ( 2007 )1, 
comparing A-means with Gauss ia n mix ture models with more general covari¬ 
ance matrix models ([Steinle v and Bruscol (1201111 : note that the authors show 
that often A-means does rather well even for non-spherical data, but this work 
is a discussion paper and some discussants highlight situations where this is 
not t he case), or a latent cla s s mix tu re model and A-m e doids for categorical 
data ( Anderlucci and Hennid ( 2014h l. Dimitriadou et all ( 2004 l is an example 
for a study on data typical for a specific application, namely functional mag¬ 
netic resonance imaging datasets. The winners of their study are neural gas and 
A-means. 

A large number of comparative simulation studies can be found in papers 
that introduce new clustering methods. However, such studies are usually often 
biased in favor of the new method that the author wants to advertise by showing 
that it is superior to some existing methods. Although such studies potentially 
contain interesting information about how clustering methods compare, having 
their huge number and strongly varying quality in mind, the a uthor takes the 
freedom to cite as a single example Coretto and Hennigl ( 2014 ), comparing ro¬ 
bust clustering methods on Euclidean data with elliptical clusters and outliers. 

A very original approach was taken by l.Iain et a~ (2004h , who did not at¬ 
tempt to rank clustering methods according to their quality. Instead, they clus¬ 
tered 35 different clustering algorithms into five groups based on their partitions 
of twelve different datasets. The similarity between the clustering algorithms 
was measured as the averaged similarity (Rand index) between the partitions 
obtained on the datasets. Given that different clustering methods serve different 
aims and may well arrive at different legitimate clusterings on the same data, 
this seems to be a very appropriate approach. Apart from already mentioned 
methods, this study includes a number of graph based and spectral clustering 
algorithms, some methods optimizing objective functions other than A'-means 
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(CLUTO), and “Chameleon-type” methods, i.e. more recent hierarchical algo¬ 
rithms based on dynamic modeling. 

Still, it is fair to say that existing work merely scratches the surface of 
what could be of potential interest in cluster benchmarking, and there is much 
potential for more systematic comparison of clustering methods. 


4.3 Axioms and theoretical characteristics of clustering 
methods 

Another line of research ai ms at e xploring whether clustering methods fulfill 
some theoretical desiderata. Ijardine and Sibson ( 197lh listed a number of sup¬ 
posedly “natural” axioms for clustering methods and showed that single linkage 
was the only clustering method fulfilling them. Single L inkage also fulfills eig ht 
out of nine of the admissibility criteria given in iFisher and Van Nessl ( 197lh , 


more than any other method compared there (which include standard hierar¬ 
chical methods and A'-means). Together with the fact that Single Linkage is 
known to be problematic in many situations because of chaining phenomena and 
the possibility to produce very large within-cluster dissimilarities, these results 
should indeed rather put into question the axiomatic approach than all methods 



d to clusterings that can be represented by ultrametrics u = F(d), such as most 
standard hierarchical clustering methods, and their monotonicity axiom requires 
d < d' => F(d) < F(d'). From the point of view of ultrametric representation 
of a distance this may look harmless, but in fact the axiom restricts the options 
for partitioning the data at the different levels of the hierarchy quite severely, 
because it implies that if d(a, b) is increased for two observations a and b that 
are in the same cluster at some level, neither a nor b nor other points in this 
cluster can be merged with points in other clusters on a lower level as a result 
of t he m odifi cation . 

Fisher and Van Nessl ( 197lh use a variant of this criterion, which requires 


that the resulting clustering does not change, and is therefore applicable to 
procedures that do not yield ultrametrics. The implications are similarly re¬ 
strictive. They state explicitly that some admissibility criteria only make sense 
in certain applications. For example, they define “convex admissibility”, which 
states that the convex hulls of different clusters do not intersect. This requires 
the data to come from a linear space and rules out certain arrangements o f non - 
linear shaped clusters. It is the only criterion in Fisher and Van Ness ( 119711 ) 
that is violated by single linkage. Other admissibility criteria are concerned 
with a method’s ability to recover certain “strong” clusterings, e.g., where all 
within-cluster dissimilarities are smaller than all between-cluster dissimilarities. 

More recently, there is some reviv ed interest in the axiomatic characteriza¬ 
tion of clustering methods. Kleinberg] ( 2002 ) proved an “impossibility theorem”, 
stating that there can be no partitioning method fulfilling a set of three condi- 
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tions claimed to be “natural”, namely scale invariance (multiplying all dissimi¬ 
larities with a constant does not change the partition), richness (any partition 
of points is a possible outcome of the method; this particularly implies that the 
number of clusters cannot be fixed) and consistency. The latter condition states 
that if the dissimilarities are changed in such a way that all within-cluster dis¬ 
similarities are made smaller or equal, and all between-cluster dissimilarities are 
made larger or equal, the clustering remains the same. Like the monotonicity 
axioms before, this is more restrictive than the author suggests, because the 
required transformation can be defined in such a way that two or more very 
homogeneous subsets emerge within a single original cluster, which intuitively 
suggests that the original cluster should then be split up (a corresponding relax¬ 
ation of the consistency condition is proposed in the paper an d does not lead to 
an impossibility theorem anymore). Furthermore. iKleinbergj ( 20021 ) shows that 
three different versions of deciding where to cut a Single Linkage dendrogram 
can fulfill any two of the three conditions, which means that these conditions 
ca nnot be used to distinguish any ot her clustering approach from single linkage. 

Ackerman and Ben-Davidl ( 2008ll respond to Kleinberg’s paper. Instead of 


using the axioms to characterize clusterings, they suggest to use them (plus some 
others) to characterize cluster quality functions (CQF), and then clusterings 
could be found by optimizing these functions. Note that a clustering method 
optimizing a consistent CQF (i.e., a CQF that cannot become worse under the 
kind of transformation of dissimilarities explained above) does not necessarily 
yield consistent clusterings, because in a modified dataset other clusterings could 
look even better. The idea also applies with modified axioms to clustering meth¬ 
ods with fixed number of clusters. Follow-up work studies specific properties of 
clustering methods with the aim of providing axioms th at serve to disti nguish 
clust ering methods as suitable for differe nt applications ( Ackerman et al. ( 2010L 
l2012ll l. A similar approach is taken bv IPuzicha et al. ( 2000h . who compare a 
number of clustering criteria based on separability measures averaging between- 
cluster dissimilarities in different ways according to a set of axioms some of 
which are very similar to the above, adding local shift invariance and robust¬ 
ness criteria that formalize that small changes to single dissimilarities can only 

ha ve limited influence on the criterion. _ 

Correa-Morrisl ( 2013 ) starts from iKleinberel ( 2002il in a different way and 


allows clustering methods to be restricted by certain param eters (suc h as th e 
number of clusters). The axioms apply to clusterings as in Kleinberel ( 20021 ). 
but a number of variants of the consistency requirement are defined, and several 
clustering methods including Single and Complete Linkage and AT-means are 
shown to be scale invariant, rich and consistent in a slightly re-defined sense. 

Still, much existing work on axiomatic characterization is concerned wit h dis- _ 

tinguishing “admissible” from “inadmissible” methods, exceptions being [Ackerman et al 


( 2010l 2012 ). This is of limited value in practice, particularly because up to now 
no method in at least fairly widespread use has been discredited because of be¬ 
ing “inadmissible” in such a theoretical sense; in case of negative results, rather 
the admissibility criteria were put into question. Still there is some potential 
in such research to learn about the clustering methods. Changing the focus 


24 












































from branding methods as generally inadmissible to distinguishing the merits 
of different approaches seems to be a more promising research direction. A 
number of other characteristics of clustering methods has been studied theoret¬ 
ically, _seejbr_examgle the references on robustness and stability measurement 


Hennig et al. (12015s). 


Ackerman and Ben-Davidl ( 2009h axiomatize “clusterability” of datasets with 


a view towards finding computationally simpler algorithms for datasets that are 
“easy” to cluster, which mainly means that there is strong separation between 
the clusters. 


5 Cluster validation 


Cluster validation is about assessing the quality of a clustering on a dataset of 
interest. Different from Section B~21 here the focus is on analyzing a real dataset 
for which the clustering is of real interest, and where no “true” clustering is 
known with which the clustering to be assessed could be compared (the ap¬ 
proaches in Sections 1522115. 4l and l5.5l can also be used in benchmarking studies). 
Quality assessment of a single clustering can be of interest in its own right, 
but methods for assessing the cluster quality can also be used for comparing 
different clusterings, be they from different methods, or from the same method 
but with different input parameters, particularly with different numbers of clus¬ 
ters. Because the latter is a central problem in cluster analysis, some literature 
uses the term “cluster validation” exclusively for methods to decide about the 
number of clusters, but here a more general meaning is intended. 

In any case cluster validation is an essential step in the cluster analysis 
process, particularly because most methods do not come with any indication 
of the quality of the resulting clustering other than the value of the objective 
function to be optimized, if there is one. 

There are several different approaches to cluster validation. Hennigl ( 2005h 
lists 


• use of external information, 

• testing for clustering structure, 

• internal validation indices, 

• stability assessments, 

• visual exploration, 

• comparison of several different clusterings on the same dataset. 


Before going through these, I start with some considerations regarding the de¬ 
cision about the number of clusters. 
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5.1 The number of clusters 


As the clustering problem as a whole, also the problem of deciding the number 
of clusters is not uniquely defined, and there is no unique “true” number of 
clusters. Even if the clustering method is chosen, the number of clusters is still 
ambiguous. The ideal situation for defining the problem properly seems to be 
if data are assumed to come from a mixture probability model, e.g., a mixture 
of Gaussians, and every mixture component is identified with a cluster. The 
problem then seems to boil down to estimating the num ber o f mixture compo¬ 
nents. To do this consistently is difficult enough (see lHennig et al.1 1 21)1 ~iV ). but 
unfortunately in reality it is an ill-posed problem. Generally, probability models 
are not expected to hold precisely in reality. But if the data come from a distri¬ 
bution that is not exactly a Gaussian mixture with finite ly many components, 
a consistent criterion (such as the BIC, see iHennig et al.l ( 20151) 1 will estimate 
a number of clusters converging to infinity, because a large dataset can be ap¬ 
proximated better with more mixture components. If mixture components are 
to be interpreted as clusters, normally at least some separation between them 
is required, which is not guaranteed if their number is estimated consistently. 

The decision about which number of clusters is appropriate in a certain 
application amounts to deciding in some way what granularity is required for 
the clustering. Ultimately, how strong separation between different clusters is 
required and a partition into how many clusters is useful in the given situation 
cannot be decided by the data alone without user input. It is often suggested 
in the literature that the number of clusters needs to be “known” or otherwise 
it needs to be estimated from the data. But if it is understood that finding the 
number of clusters in a certain application needs user input anyway, fixing the 
number of clusters is often as legitimate a user decision as the user input needed 
otherwise. There are m any su p posed l y “ob jective” criteria for finding the best 
number of clusters (see lHennig et al.l ( 20151 )). But it would be more appropriate 
to say that these criteria, instead of estimating any underlying “true” number 
of clusters, implicitly define what the best number of clusters is, and the user 
still needs to decide which definition is appropriate in the given application. 

In many situations there are good reasons not to fix the number of clusters 
but rather to give the data the chance to pick a number that fits its pattern. 
But the researcher should not be under the illusion that this can be done reli¬ 
ably without having thought thoroughl y abo ut what c luster concept is required. 
Apart from the indices listed in Hennig et al. 1 2015 1. also the statistics listed 
in Section fTTl can be used, particularly if the researcher has a quantitative idea 
about, for example, how strong separation between clusters is required. 


5.2 Use of external information 

Formal and informal external information can be used. Informally, subject 
matter experts can often decide to what extent a clustering makes sense to 
them. On one hand, this is certainly not totally reliable, and a clustering that 
looks surprising to a subject matter expert may even be particularly interesting 
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and could spark new discoveries. On the other hand, the subject matter expert 
may have good reasons to discard a certain clustering, which often points to the 
fact that the clustering aim was not well enough specified or understood when 
choosing a certain clustering method in the first place. If possible, the problem 
should then be understood in such a way that it can lead to an amendment in 
the choice of methodology. 

For formal external validation, there may be external variables or groupings 
known that are expected or desired to be related to the clustering. For exam¬ 
ple, in market segmentation, a clustering may be computed of data that gives 
preferences of customers for certain products or brands, and in order to make 
use of these clusters, they should be to some extent homogeneous also regarding 
other features of the customers such as sex, age, household size etc. This can be 
explored using techniques such as MANOVA and discriminant analysis for con¬ 
tinuous variables, and associa tion measures or tests and measures for comparing 
clusterings (see Hennig et ajJ (12015 !) ) for categorical variables and groupings. 


5.3 Testing for clustering structure 

In many clustering applications, researchers may want to determine whether 
there is a “real” clustering in the data that corresponds to an underlying mean¬ 
ingful grouping. Many clustering algorithms deliver a clu stering reg ardless of 
whether the dataset is “really” clustered. A chapter in iHennig et all ( 20151 ) is 
about methods to test homogeneity models against clustering alternatives. Note 
that straightforward models for homogeneity such as the Gaussian or uniform 
distribution may be too simple to model even some datasets without meaningful 
clusters. Significant deviations from such homogeneity models may sometimes 
be due to outliers, skew or nonlinear distributional shapes, or other structure in 
the data such as temporal or spatial autoc orrelation , in which case it is advis¬ 
able to use more complex null models, see IHennig et al.l ( 2015 ). In any case it 
is important that a significant result of a homogeneity test does not necessarily 
validate every single one of the found clusters. Homogeneity tests have been ap¬ 
plied to single clusters or pairs of clusters in order to give more local information 


about grouping structure, but this is not without problems, see Hennig et al 

(Ml). 


5.4 Internal validation indices 

A large number of indices has been proposed in the literature for evaluating 
the quality of a clustering based on t he clustered d ata alone. Such indices 
are comprehensively discussed in iHennig et al. ( 2015 ). Most of them attempt 
to summarize the clustering quality as a single number, which is somewhat 
unsatisfactory according to the discussion in Section [2] 

Alternatively it is possible to measure relevant aspects of a clustering sepa¬ 
rately in order to characterize the cluster quality in a multivariate way. Indices 
measuring several aspects of a clustering are implemented in the R-package 
“fpc”. Here are some examples: 
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• measurements of within-cluster homogeneity such as maximum or average 
within-cluster dissimilarity, within-cluster sum of squares, or the largest 
within-cluster gap; 


measurements of cluster se parati on such as the minimum or average dis¬ 
similarity between clusters; Hennig (2014) proposes the average minimum 
dissimilarity to a point from a different cluster of the 10% of observations 
for which this is smallest; 


measurements of fit such as within-cluster sum of diss imilarities from the 
centroid or Hubert’s F-type measures, see Hennig et al. ( 2015h : 


• measurements of homogeneity of different clusters, e.g., the entropy of the 
cluster sizes or the coefficient of variation of cluster-wise average distances 
to the nearest neighbor; 


• measurements of similarity between the empirical within-cluster distribu¬ 
tion and distributional shapes of interest, such as the Gaussian or uniform 
distribution. 


5.5 Stability assessment 

Stability is an important aspect of clustering quality. Certainly a clustering does 
not warrant a strong interpretation if it changes strongly under slight changes 
of the data. Alth ough there is theoretical work on clustering stability (see 
Hennig et al. 1 2013 1 h this gives very limited information about to what extent 
a specific clustering on a specific dataset is stable. 

Given a dataset, stability can be explored by generating artificial variants 
of the d ata a nd explor ing how much the clustering changes. This is treated 
Hennig et al. ( 2015h . Standard resampling approaches are nonparametric 


bootstrap, subsampling and splitting of the dataset. Alternatively, observations 
may be “jittered” or additional observations such as outliers added, although 
the latter approaches require a model for adding or changing observations. 

Aspects to keep in mind are firstly that often parts of the dataset are clearly 
clustered and other parts are not, and therefore it may happen that some clusters 
of a clustering are stable and other parts are not. Secondly, stability is not 
enough to ensure the quality or meaningfulness of a clustering. For example, a 
big enough dataset from a homogeneous distribution may allow a very stable 
clustering. For example, 2-means will partition data from a uniform distribution 
on a two-dimensional rectangle in which one side is twice as long as the other 
in a very stable manner with only a few ambiguities along the borderline of 
the two clusters. Thirdly, in some applications in which data are clustered for 
organizational reasons such as information reduction, stability is not of much 
interest. 
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5.6 Visual exploration 


The term “cluster” has an intuitive visual meaning to most people, and also 
in the literature about cluster analysis visual displays are a major device to 
introduce and illustrate the clustering problem. Many of the potentially desired 
features of clusterings such as separation between clusters, high density within 
clusters, and distributional shapes can be explored graphically in a more holis¬ 
tic (if subjective) way than by looking at index values. Standard visualization 
techniques such as scatterplots, heatplots and mosaic plots for categorical data 
as well as interactive an d dynamic graphics can be us ed both to find and to va l- 
idate clusters, see, e.g, Theus and Urbanekl ( 20081) . ICook and Swavnel ( 19991) . 
For cluster validation, one would normally distinguish the clusters using differ¬ 
ent colors and glyphs. Most people’s intuition for clusters is strongly connected 
to the low-dimensional Euclidean space, and therefore methods that project 
data into a low-dimensional Eucli dean space such as PCA are popular and 


useful. A chapter in IHenni g et aJj (120151) illustrates the use of PCA and a num¬ 


ber of other techniques for cluster visualization with a focus on network-based 
techniques and visualization of curve clustering. There are also specialized pro¬ 
jection techmpuesfor visualizing the separation between clusters in a giv e n clus ¬ 
tering ( Hennigj ( 2004 )) and f o r find ing clusters ( Bolton and Krzanowski ( 2003 ): 
Tvler et al.1 ( 2009l V)~ Hennial (2005) proposes to look for every single cluster at 
plots that show its separation from the remainder of the dataset, as well as 
projection pursuit plots for the data of a single cluster on its own to detect 
deviations from homogeneity. Such plots can also be applied to more gen¬ 
eral data formats if a dissimilarity measure exists by use of MDS. The im¬ 
plementation of MDS in the “GGvis” package allows dynamic an_d int eracti ve 
exploration of the data and of the parameters of the MDS ( Buia et all ( 20081) ). 
Anderlucci and Hcnnig ( 20141) apply MDS to visualize clusters in categorical 


data. 

A number of visualization metho ds h ave been devel oped specifically for 
clustering, of which dendrograms (see IHennig et al. ( 2015 )) are probably most 
widespread. Dendrograms are also frequently used for ordering observations in 
heatplots. Due to their ability to visualize high-dimensional information and 
dissimilarity matrices without projecting on a lower-dimensional space, heat¬ 
plots are often used for such data. Their use depends heavily on the order of 
the observations. For use in cluster validation it is desirable to plot observations 
in the same cluster together, which is achieved by the use of dendrograms for 
ordering the observations. However, it would also be desirable to order obser¬ 
vations within clusters in such a way that the transition between clusters is as 
smooth as possib le, so that not well separated clusters can be detected. This is 
tre ated by Hahsle^n^Jfornik J 20 111) . 

Kaufman and R.ousseeuwl dl99fll) intro duced the silhouette plot based on the 


silhouette width (see Hennig et al.1 ( 2015 )). which shows how well observations 
are separated from neighboring clusters. In Jornstenl (120041) this is compared 
with plots based on the within-cluster data depth. iLeischl ( 2010h introduces 
another alternative to the silhouette width based on centroids along with further 
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plots to explore how clusters are concentrated around cluster centroids. 


5.7 Different clusterings on the same dataset 


The similarity between different clusterings on th e same dataset can be measured 
using the ideas in the corresponding chapter of lHennig et all ( 2015 1. Running 
different cluster analyses on the same dataset and analyzing to what extent the 
results differ can be seen as an alternative approach to find out whether and 
which clusters in the dataset are stable and meaningful. Some care is required 
regarding the choice of clustering methods and the interpretation of results. If 
certain characteristics of a clustering are important in a certain application and 
others are not, it is more important that the chosen cluster analysis method 
delivers a good result in this respect than that its results coincide largely with 
the results of a less appropriate method. So if methods are chosen that are too 
different from each other, some of them may just be inappropriate for the given 
problem and no importance should be attached to their results. On the other 
hand, if too similar methods are chosen (such as Ward’s method and K -means), 
the fact that clusters are similar does not tell the user too much about their 
quality. Looking at the similarity of different clusterings on the same data is 
useful mainly for two reasons: 


• Several different methods may seem appropriate for the clustering aim, 
either because the aim is imprecise, or because heterogeneous and poten¬ 
tially conflicting characteristics of the clustering are desired. 

• Some fine-tuning is required (such as neighborhood sizes in density-based 
clustering, variable weighting in the dissimilarity, or prior specifications 
in Bayesian clustering), and it is of interest to explore how sensitive the 
clustering solution is to such tuning, particularly because the precise values 
of tuning constants are hardly fully determined by background knowledge. 


6 Conclusions 


In this paper, the decisions required for carrying out a cluster analysis are dis¬ 
cussed, connecting them closely to the clustering aims in a specific application. 
The paper is intended to serve as a general guideline for clustering and for 
cho osing the appro priate methodology from the many approaches on offer in 


Hennig et al. ( 2015h . 
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